Topological Grammars for Data Approximation 



A.N. Gorban^'^* N.R. Sumner^j and A.Y. Zinovyev^'^^ 
^University of Leicester, Leicester, LEI 7RH, UK 
^Institut Curie, rue d'Ulm, Paris, 75248, France 
^Institute of Computational Modeling SB RAS, Krasnoyarsk, Russia 



Abstract 

A method of topological grammars is proposed for multidimensional 
data approximation. For data with complex topology we define a princi- 
pal cubic complex of low dimension and given complexity that gives the 
best approximation for the dataset. This complex is a generalization of 
linear and non-linear principal manifolds and includes them as particular 
cases. The problem of optimal principal complex construction is trans- 
formed into a series of minimization problems for quadratic functionals. 
These quadratic functionals have a physically transparent interpretation 
in terms of elastic energy. For the energy computation, the whole complex 
is represented as a system of nodes and springs. Topologically, the princi- 
pal complex is a product of one-dimensional continuums (represented by 
graphs), and the grammars describe how these continuums transform dur- 
ing the process of optimal complex construction. This factorization of the 
whole process onto one-dimensional transformations using minimization 
of quadratic energy functionals allow us to construct efficient algorithms. 

Keywords: Principal component, Principal manifold. Graph grammar. Cubic 
complex. Elastic energy, Dataset, Approximation 



1 Introduction 

In this paper, we discuss a classical problem: how to approximate a finite set 
D in i?™ for relatively large m by a finite subset of a regular low- dimensional 
object in i?™. In application, this finite set is a dataset, and this problem arises 
in many areas: from data visualization to fluid dynamics. 

The first hypothesis we have to check is: whether the dataset D is situated 
near a low-dimensional affine manifold (plane) in i?™. If we look for a point, 
straight line, plane, ... that minimizes the average squared distance to the dat- 
apoints, we immediately come to the Principal Component Analysis (PCA). 
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PCA is one of the most seminal inventions in data analysis. Now it is text- 
book material. Nonlinear generalization of PCA is a great challenge, and many 
attempts have been made to answer it. Two of them are especially important 
for our consideration: Kohonen's Self-Organizing Maps (SOM) and principal 
manifolds. 

With the SOM algorithm [T] we take a finite metric space Y with metric p 
and try to map it into R™ with (a) the best preservation of initial structure in 
the image of Y and (b) the best approximation of the dataset D. The SOM 
algorithm has several setup variables to regulate the compromise between these 
goals. We start from some initial approximation of the map, : M — > i?™. 
On each (fc-th) step of the algorithm we have a datapoint x G D and a current 
approximation 0^ : M — > i?™. For these x and (j}k we define an "owner" of x in 
Y: j/x = argmin.ygy — (j)k{y)\\. The next approximation, (f>k+i, is 

0fc+i(2/) ^ (t)k{y) + hkw{p{y,yx)){x - (j)k{y))- (1) 

Here hk is a step size, < w{p{y, yx)) < 1 is a monotonically decreasing cutting 
function. There are many ways to combine steps in the whole algorithm. 
The idea of SOM is very flexible and seminal, has plenty of applications and 
generalizations, but, strictly speaking, we don't know what we are looking for: 
we have the algorithm, but no independent definition: SOM is a result of the 
algorithm work. The attempts to define SOM as solution of a minimization 
problem for some energy functional were not very successful |3j. 

For a known probability distribution, principal manifolds were introduced 
as lines or surfaces passing through "the middle" of the data distribution [2]. 
This intuitive vision was transformed into the mathematical notion of self- 
consistency: every point x of the principal manifold M is a conditional expecta- 
tion of all points z that are projected into x. Neither manifold, nor projection 
should be linear: just a differentiable projection tt of the data space (usually it is 
i?™ or a domain in i?™) onto the manifold M with the self-consistency require- 
ment for conditional expectations: x — E(z|7r(z) = x). For a finite dataset D, 
only one or zero datapoints are typically projected into a point of the principal 
manifold. In order to avoid overfitting, we have to introduce smoothers that 
become an essential part of the principal manifold construction algorithms. 

SOMs give the most popular approximations for principal manifolds: we can 
take for Y a fragment of a regular fc-dimensional grid and consider the resulting 
SOM as the approximation to the fc-dimensional principal manifold (see, for 
example, ^EJ)- Several original algorithms for construction of principal curves 
|S] and surfaces for finite datasets were developed during last decade, as well as 
many applications of this idea. In 1996, in a discussion about SOM at the 5th 
Russian National Seminar in Neuroinformatics, a method of multidimensional 
data approximation based on elastic energy minimization was proposed (see [7| 
ISlini and the bibliography there). This method is based on the analogy between 
the principal manifold and the elastic membrane (and plate). Following the 
metaphor of elasticity, we introduce two quadratic smoothness penalty terms. 
This allows one to apply standard minimization of quadratic functionals (i.e., 
solving a system of linear algebraic equations with a sparse matrix) . 
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2 Graph grammars and principal graphs 



Let G be a simple undirected graph with set of vertices Y and set of edges E. 
For k > 2 a, fc-star in G is a subgraph with fc + 1 vertices yo,i,...k G Y and 
k edges {{yo,yi) \ i — I, ■ ■ ■ k} C E. Suppose for each > 2, a family Sk of 
fc-stars in G has been selected. We call a graph G with selected families of 
fc-stars Sk an elastic graph if, for all E^'^'> G E and s'f^^ S Sk, the correspondent 
elasticity moduli > and ^Xkj > are defined. Let E'-^^O) , E'^^^1) be vertices 



of an edge andS'^^^(O), 



'^k^ (0) is the central vertex) . For any map (j) : Y ^ i?'" the energy of the graph 
is defined as 



. sl^\k) be vertices of a fc-star 5j. 



U) 



'among them, 



U^{G) := ^AJU(i?«(O))-0(£;«(l)) 



EM 



(2) 



Very recently, a simple but important fact was noticed |10j : every system of 
elastic finite elements could be represented by a system of springs, if we allow 
some springs to have negative elasticity coefficients. The energy of a fc-star Sk 
in i?™ with yo in the center and fc endpoints yi^...k is Us^. = {J2'i=i Vi ~ kyoY , 
or, in the spring representation, u^^ = k^^^ ZLi (y» ~2/o)^ -/^s^ Y.i>Ayi~y]f- 
Here we have fc positive springs with coefficients fc/is^ and fc(fc — negative 
springs with coefficients — //g^ . 

For a given map (f) -.Y ^ R™ we divide the dataset D into subsets K^, y eY . 
The set contains the data points for which the node (f){y) is the closest one 
in 4>{Y). The energy of approximation is: 



EE 



w{x)\\x - (j){y)\\ 



(3) 



where w{x) > are the point weights. 

The simple algorithm for minimization of the energy If^ = U'^{G,D) + 
U'^{G) is the splitting algorithm, in the spirit of the classical fc-means clustering: 
for a given system of sets {K^ \ y G Y} we minimize (it is the minimization 
of a positive quadratic functional), then for a given cj) we find new {K^}, and 
so on; stop when no change. This algorithm gives a local minimum, and the 
global minimization problem arises. There are many methods for improving the 
situation, but without guarantee of the global minimization. 

The next problem is the elastic graph construction. Here we should find a 
compromise between simplicity of graph topology, simplicity of geometrical form 
for a given topology, and accuracy of approximation. Geometrical complexity 
is measured by the graph energy U'^{G), and the error of approximation is 
measured by the energy of approximation U'^{G,D). Both are included in the 
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Figure 1 : Applying a simple "add a node to a node or bisect an edge" grammar 
to construct principal elastic trees (one node is added per iteration). Upper 
row: an example of two-dimensional branching distribution of points. Lower 
row: the classical benchmark, the "iris" four-dimensional dataset (point shapes 
distinguish three classes of points), the dataset and principal tree are presented 
in projection onto the plane of first two principal components. 



energy [/"^. Topological complexity will be represented by means of elementary 
transformations: it is the length of the energetically optimal chain of elementary 
transformation from a given set applied to initial simple graph. 

Graph grammars |lllll2j provide a well-developed formalism for the descrip- 
tion of elementary transformations. An elastic graph grammar is presented as 
a set of production (or substitution) rules. Each rule has a form A B, where 
A and B are elastic graphs. When this rule is applied to an elastic graph, a 
copy of A is removed from the graph together with all its incident edges and 
is replaced with a copy of B with edges that connect B to graph. For a full 
description of this language we need the notion of a labeled graph. Labels are 
necessary to provide the proper connection between B and the graph. 

A link in the energetically optimal transformation chain is constructed by 
finding a transformation application that gives the largest energy descent (af- 
ter an optimization step), then the next link, and so on, until we achieve the 
desirable accuracy of approximation, or the limit number of transformations 
(some other termination criteria are also possible) . The selection of an energet- 
ically optimal application of transformations by the trial optimization steps is 
time-consuming. There exist alternative approaches. The preselection of appli- 
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Figure 2: The Cartesian product of graphs. 

cations for a production rule A ^ B can be done through comparison of energy 
of copies of A with its incident edges and stars in the transformed graph G. 

As the simple (but already rather powerful) example we use a system of 
two transformations: "add a node to a node" and "bisect an edge." These 
transformations act on a class of primitive elastic graphs: all non-terminal nodes 
with k edges are centers of elastic k-stars, which form all the fc-stars of the 
graph. For a primitive elastic graph, the number of stars is equal to the number 
of non-terminal nodes - the graph topology prescribes the elastic structure. 

The transformation "add a node" can be applied to any vertex y of G: add 
a new node z and a new edge (?/, z). The transformation "bisect an edge" is ap- 
plicable to any pair of graph vertices y,?/' connected by an edge {y,y'): Delete 
edge (y, y'), add a vertex z and two edges, (y, z) and (z, y'). The transformation 
of elastic structure (change in the star list) is induced by the change of topology, 
because the elastic graph is primitive. This two-transformation grammar with 
energy minimization builds principal trees (and principal curves, as a particular 
case) for datasets. A couple of examples are presented on Fig. ^ For applica- 
tions, it is useful to associate one-dimensional continuums with these principal 
trees. Such a continuum consists of node images 4>{y) and of pieces of straight 
lines that connect images of linked nodes. 

3 Factorization and transformation of factors 

If we approximate multidimensional data by a fc-dimensional object, the num- 
ber of points (or, more general, elements) in this object grows with k expo- 
nentially. This is an obstacle for grammar-based algorithms even for modest 
k, because for analysis of the rule A — > _B applications we should investigate 
all isomorphic copies of A in G. The natural way to avoid this obstacle is the 
principal object factorization. Let us represent an elastic graph as a Carte- 
sian product of graphs (Fig. |21). Cartesian products Gi x . . . x of elastic 
graphs Gi, . . .Gr is an elastic graph with the vertex set Vi x . . . x Vr- Let 
1 < i < r and Vj 6 Vj (j ^ i). For this set of vertices, {vj}j^i, a copy of Gi in 
Gi X . . . X Gr is defined with vertices (wi, . . . v, w^+i, . . . Vr) {v G Vi), edges 
((«!,. . .Vi-i,v,Vi+i, . ..Vr), {vi, . ..Vi-i,v',Vi+i, . . . Vr)) i{v,v') S Ei) and, simi- 
larly, fc-stars of the form (ui, . . . Vi-i, 5*^, w^+i, . . . Vr), where Sk is a /c-star in Gj. 
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For any d there are j^i l^j I copies of Gt in G. Sets of edges and fc-stars for 
Cartesian product are unions of that set through all copies of all factors. A map 
(j) : Vi X . . . X Vr ^ i?™ maps all the copies of factors into i?™ too. Energy of 
the elastic graph product is the energy sum of all factor copies. It is, of course, 
a quadratic functional of 4>. 

The only difference between the construction of general elastic graphs and 
factorized graphs is in application of transformations. For factorized graphs, we 
apply them to factors. This approach significantly reduces the amount of trials 
in selection of optimal application. The simple grammar with two rules, "add 
a node to a node or bisect an edge," is also powerful here, it produces products 
of primitive elastic trees. For such a product, the elastic structure is defined by 
the topology of the factors. 

4 Conclusion: adaptive dimension and principal 
cubic complexes 

In the continuum representation, factors are one-dimension continuums, hence, 
a product of r factors is represented as an r-dimensional cubic complex J3] 
that is glued together from r-dimensional parallelepipeds ("cubes"). Thus, the 
factorized principal elastic graphs generate a new and, as we can estimate now, a 
useful construction: a principal cubic complex. One of the obvious benefits from 
this construction is adaptive dimension: the grammar approach with energy 
optimization develops the necessary number of non-trivial factors, and not more. 
These complexes can approximate multidimensional datasets with complex, but 
still low-dimensional topology. The topology of the complex is not prescribed, 
but adaptive. In that sense, they are even more flexible than SOMs. The whole 
approach can be interpreted as a intermediate between absolutely flexible neural 
gas |14| and significantly more restrictive elastic map jH]. It includes as simple 
limit cases the /c-means clustering algorithm (low elasticity moduli) and classical 
PCA (high ^ for 5*2 and ^ oo for Sk, k > 2). 
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